Join two sheets, groupby and sum on the joined data


In [53]:
import pandas as pd

In [75]:
# load both sheets as new dataframes
shows_df = pd.read_csv("show_category.csv")
views_df = pd.read_excel("views.xls")

Change the index to be the showname


In [76]:
shows_df.head()


Out[76]:
showname category
0 firefly scifi
1 hollyoaks soap
2 battlestar scifi

In [77]:
shows_df = shows_df.set_index('showname')
shows_df.head()


Out[77]:
category
showname
firefly scifi
hollyoaks soap
battlestar scifi

Do the same for views

DANGER note that battle-star has a hyphen!


In [78]:
views_df.head()


Out[78]:
viewer_id show_watched views
0 23 firefly 2
1 29 hollyoaks 4
2 22 firefly 100
3 22 battle-star 10
4 22 FIREFLY 200

In [79]:
views_df = views_df.set_index('viewer_id')
views_df.head() # note that we can have repeating viewer_id values (they're non-unique)


Out[79]:
show_watched views
viewer_id
23 firefly 2
29 hollyoaks 4
22 firefly 100
22 battle-star 10
22 FIREFLY 200

In [80]:
# we can select out the column to work on, then use the built-in str (string) functions
# to replace hyphens (we do this and just print the result to screen)
views_df['show_watched'].str.replace("-", "")


Out[80]:
viewer_id
23       firefly
29     hollyoaks
22       firefly
22    battlestar
22       FIREFLY
22      FIREFLY 
Name: show_watched, dtype: object

In [81]:
# now we do the fix in-place
views_df['show_watched'] = views_df['show_watched'].str.replace("-", "")
# NOTE if you comment out the line above, you'll get a NaN in the final table
# as `battle-star` won't be joined
views_df


Out[81]:
show_watched views
viewer_id
23 firefly 2
29 hollyoaks 4
22 firefly 100
22 battlestar 10
22 FIREFLY 200
22 FIREFLY 300

In [82]:
print("Index info:", views_df.index)
views_df.ix[22] # select the items with index 22 (note this is an integer, not string value)


Index info: Int64Index([23, 29, 22, 22, 22, 22], dtype='int64', name='viewer_id')
Out[82]:
show_watched views
viewer_id
22 firefly 100
22 battlestar 10
22 FIREFLY 200
22 FIREFLY 300

Join on shows watched against category

PROBLEM - we have NaN values for the last two Firefly entries. Can you do something, earlier on, to fix this, from here inside the Notebook?


In [83]:
shows_views_df = views_df.join(shows_df, on='show_watched')
shows_views_df


Out[83]:
show_watched views category
viewer_id
23 firefly 2 scifi
29 hollyoaks 4 soap
22 firefly 100 scifi
22 battlestar 10 scifi
22 FIREFLY 200 NaN
22 FIREFLY 300 NaN

In [84]:
# take out two relevant columns, group by category, sum the views
shows_views_df[['views', 'category']].groupby('category').sum()


Out[84]:
views
category
scifi 112
soap 4